# LLaVA-NeXT & VideoMME Benchmark Demo

This demo walks you through evaluating vision-language models with the **VideoMME** benchmark, leveraging the power of [LLaVA-Video](https://github.com/LLaVA-VL/LLaVA-NeXT).

---

## 🛠️ Installation & Setup

### 1. Clone Repository & Set Up Environment

```bash
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support
cd LLaVA-NeXT
pip install -e ".[train]"
```

---

### 2. Install flash-attn

*Please download the correct wheel based on your Python, PyTorch, and CUDA versions.*

**Example (Python 3.10, CUDA 12.2, PyTorch 2.1):**
```bash
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.0/flash_attn-2.6.0+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

pip install flash_attn-2.6.0+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

pip install transformers==4.45.2
```
> For other versions, see the official [flash-attn releases](https://github.com/Dao-AILab/flash-attention/releases).

---

### 3. Download VideoMME Dataset

Follow the instructions in the [Video-MME repository](https://github.com/MME-Benchmarks/Video-MME) to download and prepare the dataset.

---

### 4. Copy Project Files

Copy all files from your current directory into the `LLaVA-NeXT` directory.

---

### 5. Feature Extraction

Extract DINOv2 features and BLIP2-ITM scores:

```bash
python extract_feature_scores.py --dataset_path "YOUR_DATASET_PATH"
```

---

### 6. Event-Anchored Frame Selection

```bash
python EFS_selected.py --dataset_path "YOUR_DATASET_PATH"
```

---

## 🚀 Inference

> **Tip:** You can skip steps 5 & 6 if using the provided precomputed frame selection file:  
> `videomme_16frames_selected_by_efs.json`

**A. Uniform Sampling**
```bash
python llava_inference_videomme_source.py --dataset_path "YOUR_DATASET_PATH"
```

**B. Event-Anchored Frame Selection (EFS Sampling)**
```bash
python llava_inference_videomme_source.py --dataset_path "YOUR_DATASET_PATH"
```

---

## 📊 Evaluation

```bash
python eval_videomme.py --results_file "YOUR_RESULT_JSON_PATH"
```

---

## 📌 Notes

- This pipeline can be easily adapted for use with other Large Vision-Language Models (LVLMs).
- For more details on the VideoMME dataset and annotation, refer to the [Video-MME repository](https://github.com/MME-Benchmarks/Video-MME).

---

**Happy benchmarking! 🚩**
